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ABSTRACT 

Analytic performance models are essential for understanding the 
performance characteristics of loop kernels, which consume a ma¬ 
jor part of CPU cycles in computational science. Starting from a 
validated performance model one can infer the relevant hardware 
bottlenecks and promising optimization opportunities. Unfortu¬ 
nately, analytic performance modeling is often tedious even for ex¬ 
perienced developers since it requires in-depth knowledge about 
the hardware and how it interacts with the software. We present the 
“Kerncraft” tool, which eases the construction of analytic perfor¬ 
mance models for streaming kernels and stencil loop nests. Starting 
from the loop source code, the problem size, and a description of 
the underlying hardware, Kerncraft can ideally predict the single¬ 
core performance and scaling behavior of loops on multicore pro¬ 
cessors using the Roofline or the Execution-Cache-Memory (ECM) 
model. We describe the operating principles of Kerncraft with its 
capabilities and limitations, and we show how it may be used to 
quickly gain insights by accelerated analytic modeling. 

1. INTRODUCTION 

This paper is concerned with analytic performance modeling on the 
CPU core and chip level. We define an analytic performance model 
as a simplified mathematical description of the interaction between 
program code (“software”) and hardware, constructed with the goal 
in mind to understand the dominating bottlenecks in program ex¬ 
ecution performance. The fundamental resources required for the 
execution of any program are instructions and data', thus, in order 
to predict the time it takes to execute a program one must have 
models for how instructions are executed in the CPU core(s) and 
how data travels through the memory hierarchy. It should be em¬ 
phasized that the more accurate a model the more specific to a CPU 
architecture it will be. However, even simple analytic models can 
lead to surprising and useful insights. 

A simple, very optimistic model for instruction execution would 
assume perfect out-of-order processing, perfect pipelining, and no 
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dependencies, with all required data residing in the LI cache. In¬ 
structions would be distributed among the suitable pipelines, and 
the pipeline that takes the longest time to execute its instructions 
would determine the runtime of the code. We term this model the 
throughput model (TP). If dependencies along the critical path are 
taken into account one arrives at a more pessimistic model, the crit¬ 
ical path model (CP). In reality, and especially for loop-based codes 
where some “steady state” execution may be assumed, the actual 
runtime will be in between the TP and CP models as long as no 
other bottlenecks apply, such as instruction cache misses, instruc¬ 
tion throughput limitations, or data transfers beyond the LI cache. 

The Roofline model (see Se ct.|2.2| > and the Execution-Cache-Mem¬ 
ory (ECM) model (see Sect. |2.3| l~are two useful performance mod¬ 
els for steady-state loop codes that build on an instruction execution 
model but also take data transfers into account. Both models may 
require considerable effort and experience to construct in compli¬ 
cated cases (e.g., when the code is composed of many loops, or 
when a loop has a complicated structure). In this work we demon¬ 
strate a software tool, Kerncraft, that can automatically construct 
these models from a C code formulation and a suitable hardware 
description without actually executing the code. We restrict our¬ 
selves to streaming and stencil loop nests, where the data access 
patterns can be statically determined at compile time. It is not our 
intention to assess the usefulness or accuracy of the models them¬ 
selves, although some examples will be given that highlight their 
capabilities. 

Supporting tools are employed to determine parameters that are re¬ 
quired as model input in the machine description. We use the LIK- 
WID tool suite eg for most of these tasks: The machine topology, 
i.e., information about core and cache sharing, ccNUMA struc¬ 
ture, cache sizes, etc., is extracted from the output of likwid- 
topology. Achievable bandwidths to caches and main memory 
are measured with the likwid-bench tool |20| , since it provides a 
controlled and compiler-independent environment for building tai¬ 
lored benchmark loops. 

Any analytic performance model must be checked for validity by 
comparing its predictions with measurements on the target hard¬ 
ware. The validation of predictions with measurements is an inte¬ 
gral part of the Kerncraft tool. 

This paper is organized as follows: In Sect. [2] we briefly de¬ 
scribe the components of the performance models (in-core model. 
Roofline, and ECM) supported by Kerncraft. Section|^introduces 
the hardware and software used for all experiments. Details about 
the structure of the Kerncraft tool and its concrete implementation 


Listing 1: Scalar product in double precision 

double a[], b[], s = 0.; 
for(i=0; 1<N; ++i) 
s += a [i] * b[i] ; 


are given in Sect.|^ In Sect. we evaluate the tool using streaming 
and stencil loop codes, and Sect.[7]gives a summary and an outlook 
to future work. 

The current version of Kerncraft is available for download at 
https://github.com/RRZE-HPC/kerncraft 

2. BACKGROUND 

In this section we briefly describe the required components for 
steady-state loop performance modeling: the in-core execution 
model, the Roofline model, the ECM model, and approaches for 
model validation. 

2.1 In-core execution modeling 

In simple cases the throughput and critical path modeling described 
above can be done by hand. Listing shows the source code for a 
scalar product in double precision. On a CPU with a three-stage 
ADD pipeline, the (scalar) naive code, i.e., without any unrolling 
and modulo variable expansion, would be limited by the depen¬ 
dency on the summation variable s. Hence, the core would execute 
one floating-point operation (flop) every 3 cy, which is also the CP 
prediction for one loop iteration. The TP prediction would be 2cy 
if the core can execute one LOAD, one ADD, and one MULT in¬ 
struction per cycle. With appropriate unrolling and modulo variable 
expansion this limit can be achieved in practice if no other bottle¬ 
necks apply. 

With more complicated code the pencil-and-paper analysis be¬ 
comes tedious and error-prone. One solution, albeit limited to 
Intel architectures, is the Intel Architecture Code analyzer (lACA) 
(H) which provides TP and CP predictions for assembly code 
sections. In Kerncraft we use the lACA TP output (and optionally 
the CP output) as the in-core component in the ECM and Roofline 
models. lACA also provides information about port utilization and 
points out a front-end bottleneck in case the code would need more 
concurrency than what is supported by the instruction decoders. 

2.2 Roofline model 

The Roofline model is optimistic by design, i.e., it yields an ab¬ 
solute lower execution time limit for a loop. The ideas behind the 
model have been in use since the 1980s (16|[^[^, but it was pop¬ 
ularized (and got its current name) by Williams et al. in 2009 (231 . 
It is based on the assumption that performance is limited either by 
data transfers from a certain memory hierarchy level or by the com¬ 
putational (arithmetic) work, whichever takes longer. This implies 
that all data transfers perfectly overlap with computation. In order 
to apply the model, the data volume from and to each memory hi¬ 
erarchy level k, jSj., needs to be assessed and put in relation to the 
achievable peak bandwidth Bj^ of that level. Their ratio Tj^ = 
determines the data transfer time for level k. To have a formula¬ 
tion that is, as far as possible, independent of the clock frequency 
setting, bandwidths are best given in bytes per clock cycle. 

In a naive approach the time required for computation can be cal¬ 


culated by dividing the work <fi (usually flops, but any other well- 
defined metric will do) by the applicable computational peak per¬ 
formance Ppeak of the code at hand: T^ore = (p/Ppeak (again, Rpeak 
may be given in flops/cy for a frequency-independent analysis). 
However, an in-core execution model as described in Sect. HD is 
usually more accurate. The final execution time prediction T is 
then T^-oof = maxj. (Tcore, ?(:)• No substantial changes are required 
to adapt the model for multiple cores; the achievable bandwidth 
Bkiji) on n cores in memory level k must of course be determined 
by measurements, and the data traffic and code execution charac¬ 
teristics must take changes implied by parallelization into account 
(e.g., if data is shared among multiple cores using the outer-level 
cache). 

2.3 ECM model 

The Execution-Memory-Cache (ECM) model requires the same in¬ 
formation about the kernel code as the Roofline model. In contrast 
to Roofline, it drops the assumption of a single bottleneck: Trans¬ 
fers of data through the memory hierarchy are serialized across the 
hierarchy levels of a core and therefore contribute to the reduc¬ 
tion of the total performance without overlapping with one another. 
The bandwidths associated with each cache level are not taken 
from benchmarks, which would also include contributions from all 
higher (closer to registers) cache levels, but from published docu¬ 
mentation by the vendors. The only measured input is the saturated 
maximum bandwidth of the memory interface. Since a cache line 
is the “atomic” data package in the system, it is convenient to for¬ 
mulate the model with a certain “unit of work” in mind, usually a 
number of iterations that leads to a small integer number of cache 
line transfers (this pertains to the Roofline model as well). 

On a machine with three cache levels, there are five contributions 
to the total single-core performance. We write them in a compact 
notation (see ]18| for details): 

{Tol II TnOL I Tklkl I 7l2L3 I 7l3MEm} • 

Tol and TnOL are the overlapping and non-overlapping times that 
come out of the in-core model (see Sect.|^for details on how these 
numbers are determined on a specific architecture). The non-over- 
lapping contribution is serialized with the times for the data trans¬ 
fers between adjacent memory levels Ti^\k 2 , 7l 2L35 and Tl 3MEM! 
whereas Tol overlaps with all data transfers. Note that the data 
transfer times are not identical to the Tk in the Roofline model, 
since they strictly measure the time for getting data from one mem¬ 
ory level to the next; the Tk, on the other hand, quantify the full 
transfer time from level k into the registers. A runtime prediction 
for a data set in memory would be 

7ECM,Mein = max (ToL, TnOL + rLlL2 + rL2L3 + rL3Mem) • 

For summarizing predictions for data in different memory levels 
we use the following notation: 

{ TeCM.LI 1 7eCM,L2 1 7eCM,L 3 1 TECM.Mem} • 

In contrast to the Roofline model, the ECM model is not opti¬ 
mistic by design, since it drops the single-bottleneck view. In prac¬ 
tice, however, it is rare that performance measurements exceed the 
model’s predictions. 

For multicore scaling, the model assumes perfect scalability until a 
bandwidth bottleneck (usually the main memory bandwidth) is hit. 
It thus predicts the number of cores where the loop performance 
ceases to scale: nj = rECM.Mem/rL3Mem- At that point, the ECM 




model prediction is identical to the bandwidth-based Roofline pre¬ 
diction. See |18| for more validations of the ECM model. 

2.4 Model validation 

A performance model should be validated for correctness by run¬ 
ning the code on the target hardware. Different levels of validation 
are possible. In the simplest case, the measured runtime or per¬ 
formance is compared with the prediction. Even if those numbers 
agree, it may still be the case that several errors cancel out and 
the agreement is purely accidental. Performance counter measure¬ 
ments can be used to verify quantities that come out of the model 
beyond the pure runtime, e.g., transferred data volume, memory 
bandwidth utilized, cache misses, etc. 

In many cases it is particularly useful to vary problem and execu¬ 
tion parameters such as problem sizes, core counts, floating-point 
precision, etc., in the model verification process, so that the model 
can be checked at many points in this configuration space. This also 
leads to an improved understanding of the inherent bottlenecks by 
identifying typical performance patterns ED- 

3. TESTBED 

We choose two recent Intel CPU architectures for showing the 
modeling results obtained with Kemcraft: Sandy Bridge and 
Haswell. See Table for a summary of documented hardware 
properties. Note that measured inputs for the Roofline and ECM 
models are not included for brevity. They are contained in the 
machine description files distributed with the Kerncraft tool [Tj. 

The Haswell system was configured to use “Cluster on Die” (CoD) 
mode. In CoD, one socket is effectively split into two ccNUMA 
domains (with seven cores each in our case), with two memory con¬ 
trollers per domain and separate L3 caches. The single-core ECM 
model uses the saturated memory bandwidth of one ccNUMA do¬ 
main for TLaMem- In addition, we have deactivated “Uncore clock 
frequency scaling,” a power saving feature that reduces the clock 
speed of the Uncore part of the chip (L3 cache and memory con¬ 
trollers) when only one or two cores are active. All measurements 
were done with the clock speed fixed to the base frequency (no 
“Turbo Mode”) on both systems. None of our benchmarks showed 
the documented core slowdown feature (reduced base clock speed 
with highly efficient vectorized code) on Haswell (7). 

The software components used were the Intel compiler in ver¬ 
sion 15, LIKWID 4.0, Python 2.7, and lACA 2.1. 

4. THE KERNCRAFT TOOL AND SUPPORT¬ 
ING SOFTWARE 

4.1 Overall architecture 

Kerncraft Q combines static source code analysis with static as¬ 
sembly analysis and cache access simulation to yield performance 
predictions using the ECM and Roofline models for steady-state 
loop kernels. The kernel is specified using the C language with 
some restrictions (see Sect. |4.3| l. Static code analysis yields infor¬ 
mation about the number of flops, the data structures, and the data 
volumes in the different memory hierarchy levels. The source is 
compiled and the machine code is handed to lACA, which calcu¬ 
lates an optimistic throughput (TP) prediction for the loop body 
execution time. In addition, Kemcraft requires a hardware descrip¬ 
tion file with architectural properties. With data transfers and in- 
core predictions in place, all information is put together to yield 


Microarchitecture 

SandyBridge-EP 

Haswell-EP 

Abbreviation 

SNB 

HSW 

Xeon model name 

E5-2680 

E5-2695 v3 

Clock (fixed) 

2.7 GHz 

2.3 GHz 

Cores/Threads 

8/16 

2x7/28 

AVX throughput 

1LD & ^ ST 

2LD& 1ST 

scalar throughput 

2LD 1LD& 1ST 

2LD& 1ST 

ADD throughput 

1/cy 

1/cy 

MUL throughput 

1/cy 

2/cy 

FMA3 throughput 

unsupported 

2/cy 

L1-L2 bandwidth 

32B/cy 

64B/cy 

L2-L3 bandwidth 

32B/cy 

32B/cy 

L3-MEM bw. (peak) 

51.2GB/S 

68.3 GB/s 


Table 1: Testbed hardware specifications and relevant microarchi¬ 
tecture features. Note that the theoretical memory bandwidth is not 
used for modeling. 


Roofline and ECM predictions. For a general overview of the Kem¬ 
craft analysis steps see Eig.[T] Details about the different steps are 
given in the following sections. 

The ECM and Roofline analysis can be executed on any machine 
with the required software installed, since the code itself is not ac¬ 
tually run. An optional “Benchmark” mode executes the compiled 
kernel code to measure the performance rather than predicting it, 
thereby validating the model(s). 

In its current version the tool makes certain assumptions about 
the hardware and software that go beyond those required for the 
Roofline and ECM models: a perfect LRU cache replacement strat¬ 
egy and fully associative, fully inclusive, and write-allocate/write- 
back caches. These fit best to current Intel CPUs, but can be lifted 
to adapt the framework to other architectures. Furthermore, the 
compiler is assumed to not alter the data access characteristics of 
the “as-is” loop source code by, e.g., automatic blocking. Unless 
the tool is integrated in a compiler, this restriction is hard to cir¬ 
cumvent. 

Kerncraft and its support scripts are implemented in Python 2.7 and 
are available on the Python Package Index (PyPI). 

4.2 Hardware description 

The hardware description file contains information about the mi¬ 
croarchitecture, the system architecture (e.g., number of cores and 
sockets), and microbenchmark results. It can be created by hand 
or with the help of the support script likwid_auto_bench.py, 
which automatically creates a formatted YAML file by col¬ 
lecting the required information using likwid-topology and 
likwid-bench on the machine it is executed on. It performs mi¬ 
crobenchmark runs with several streaming kernels in all memory 
hierarchy levels and with all possible numbers of cores and threads. 

In addition to the automatically gathered information, cache trans¬ 
fer speeds, compiler flags, and the categorization of execution ports 
into overlapping and non-overlapping, among others, need to be 
manually added to the generated file. 

Hardware information is required for all Kerncraft analyses, but it 
can be gathered beforehand, on the targeted machine, and reused 
for analysis runs on any other system. The hardware description 










user-input 

Intermediate kernel code constants 

Output #define N 1000 

#de£ine M 2000 

for (j=l ; j < N-1; ++j) 
for(i=l; i < M -1; ++i) 
b[j] [i] = (a[ j ] [i-1] + a[ j ] [i+1] 

+ a[j -1] [ i ] + atj+1] [ i ] )* s; 



Figure 1: Overview of Kemcraft analysis 


Listing 2: Hardware description for Intel Xeon E5-2680 Sandy 
Bridge with 2 Sockets 

clock : 2.7 GHz 

cores per socket : 8 

model type: Intel Core SandyBridge EP processor 

model name: Intel Xeon CPU E5-2680 @ 2.70GHz 

sockets: 2 

threads per core: 2 

cacheline size : 64 B 

icc architecture flags: [-xAVX] 

micro-architecture : SNB 

FLOPS per cycle: 

SP: {total: 16, ADD: 8, MUL: 8} 

DP: {total: 8, ADD: 4, MUL: 4} 
overlapping ports: 

["0", "ODV", "1", "2", "3", "4", "5"] 

non-overlapping ports: ["2D", "3D"] 

memory hierarchy: 

- {level: LI, cores per group: 1, 

cycles per cacheline transfer: 2, 
groups: 16, bandwidth: null, 
size per group: 32.00 kB, 
threads per group: 2} 

# [... continues with description of L2 and L3] 

- {level: MEM, cores per group: 8, 

cycles per cacheline transfer: null, 
groups: 1, bandwidth: null, 
size per group : null , 
threads per group: 16} 

benchmarks: 
kernels : 
copy : 

FLOPS per iteration: 0 
read streams: 

{bytes: 8.00 B, streams: 1} 

read+write streams: 

{bytes: 0.00 B, streams: 0} 

write streams: 

{bytes: 8.00 B, streams: 1} 

# [... descriptions of daxpy, load, triad 

# and update follow...] 
measurements: 

# [... bandwidth measurements follow ...] 


files for the architectures used in this paper are distributed with the 
Kerncraft framework. Listing [^contains the hardware description 
file for the Sandy Bridge machine described in Table[^ we left out 
the benchmark results and L2 and L3 cache descriptions for brevity. 

4.3 Code input 

The source code that is subject to the analysis must be provided 
in a separate file. The required syntax is based on the ISO C99 
standard (ID- Constants (e.g., problem sizes) can be passed on 
the command line. To simplify the analysis, there are currently 
some restrictions on the full C99 standard. For instance, array dec¬ 
larations may only use fixed sizes or constants, with an optional 
addition or subtraction of an integer (e.g., double u [N] [M-h 3] [N- 
2] [5], but not double u [M*N]), and array indices must use a loop 
index variable (with optional addition or subtraction), constants, or 
fixed integers. A full list of limitations may be found in the soft¬ 
ware documentation. Listing shows a Jacobi kernel code with 
constants M and N defining the problem size in the two dimensions. 


The pycparser Q Python library is used for parsing the code. The 
resulting abstract syntax tree (AST) is then validated in regard to 
the mentioned restrictions and statically analyzed. We gather the 
following information: loop stack, data sources and destinations, 
and floating point operations (flop). The loop stack contains infor¬ 
mation about all for loops in the code: their order, index variable 
name, start value, end value, and step size. See Table for the 
loop stack information gathered from Listing Data sources and 
destinations are retrieved from the statements in the innermost for 
loop. Any data access can be either direct (e.g., to a scalar or an 
array with constant or fixed index) or relative with an optional off¬ 
set. Multidimensional arrays can have mixed accesses; for instance. 


index variable 

start 

end 

step size 

j 

1 

499 

-Hi 

i 

1 

4999 

-Hi 


Table 2: Loop stack of Listingj^with N=5000 and M=500 

























Listing 3: 2D 5-point Jacobi kernel code 

double a[M] [N] ; 
double b[M][N]; 
double s; 

forCint j=l: ++j) 

forCint 1=1; i<N-l; ++i) 

b[j] [i] = (a[j] [i-1] + a[j] [i + 1] 

+a[j-l][i] + a[j+l][i]) * s; 

}}} 


variable 

dimension 

2"‘^ dimension 

a 

relative j 

relative i — 1 


relative y 

relative i + 1 


relative y — 1 

relative i 


relative y + 1 

relative i 

s 

direct 


Table 3: Data sources of Listingj^with N=5000 and M=500 

xy [0] [j] [i+1] is a direct access on the first dimension, relative 
on the second, and relative with an offset of +1 on the third. See 
Tables[^and|^for the data access analysis of Listing]^ 

Finally the floating point operation count of the inner loop body is 
extracted, for cases where a simple in-core model is required (e.g, if 
lACA is unavailable). Since this is based on the plain source code, 
compiler optimizations such as common subexpression elimination 
or compile-time evaluation, as well as more intricate dependencies 
on the in-core parallelism are ignored, and can lead to inaccuracies. 

4.4 In-core prediction 

We use the Intel Architecture Code Analyzer (lACA) to statically 
analyze the in-core performance of the kernel code. It is freely 
available and provides accurate throughput and critical path pre¬ 
dictions for code execution under the assumption that all code and 
data comes from the LI caches. Its predictions are based on doc¬ 
umented as well as unreleased information about Intel CPUs from 
Nehalem to Haswell. The drawback is that older microarchitectures 
and architectures by other vendors are not supported. 

lACA operates on binaries and therefore requires the code to be 
compiled. To do so, the kernel code is transformed by inserting it 
into a main function and adding boilerplate code to handle declara¬ 
tions and constants specified on the command line. All array decla¬ 
rations are replaced by pointer declarations with heap allocation us¬ 
ing _mni_malloc in order to have arrays that are aligned to 32-byte 
boundaries. The use of heap memory prevents problems with stack 
size constraints. The loop stack is left untouched, but the inner 
loop’s statements are changed to reflect the changed declarations, 
by replacing multi-dimensional indexes with single-dimension in¬ 
dex arithmetic. To prevent the compiler from eliminating parts of 
the benchmark code by optimization, calls to external dummy func¬ 
tions are inserted. The resulting code for the 2D-5pt Jacobi kernel 

variable | D* dimension 2"^^ dimension 
b I relative j relative i ~ 

Table 4: Data destinations of Listingj^with N=5000 and M=500 


Listing 4: Runnable code (shortened) generated by Kemcraftfor 
lACA analysis with dummy function calls. 

#include <stdlib.h> 

void dummy (double *, ... ); 

extern int var_false; 

int mainCint argc, char **argv) { 
const int N = atoi (argv [2] ) ; 
const int M = atoi (argv [ 1] ) ; 

double *a = _mm_malloc( 

(sIzeof(double)) * (M * N), 32); 

for (int i=0; i<(M*N); ++i) a[i] = 0.23; 

II [. . .] allocate and init b [] 
double s = 0.23; 

if (var_false) dummy(a, b, &s); 

asmC'nop"); II to limit search in assembly 
for (int j=l; j<(M-l); ++j) 

for (int i=l; i<(N-l); ++i) 

b[i+(j*N)] = (a[(i-l)+(j*N)] 

+ a[(i+l)+(j*N)] 

+ a[i+((j-l)*N)] 

+ a[i+((j+1)*N)]) * s; 

asmC'nop"); II to limit search in assembly 

} 


is shown in Listing]^ 

To narrow down the throughput analysis to the relevant loop sec¬ 
tion, lACA needs to find special markers in the machine code. 
These are inserted by our tool, modifying the assembly produced by 
the compiler. There is an option to insert markers as in-line assem¬ 
bly in the C source, but this interferes with compiler optimizations. 
Finding the correct block in the assembly code is done by selecting 
the block with the most vector instructions. The user can always 
fall back to an interactive procedure by providing a command line 
switch (—asm-block=manual) to Kerncraft. 

The unrolling factor is extracted from the marked section by ana¬ 
lyzing the loop counter increments used at the end of the selected 
block. This is needed to allow interpretation of the throughput 
analysis results from lACA. For example, if a loop is four times 
unrolled and lACA predicts a throughput of 32 cy then one single 
loop iteration and update step takes 8 cy. 

lACA provides the number of cycles, per execution port, that one 
unrolled iteration (i.e., one assembly code loop body) takes. In 
case of the 2D 5-point Jacobi kernel on a Sandy Bridge microar¬ 
chitecture, it takes 16cy in the data portion of ports 2 and 3 (“2D” 
and “3D”). The maximum cycles of the data portions in ports 2 
and 3 makes up the non-overlapping contribution, thus one might 
conclude that it would be 16 cy. However, one unrolled iteration 
encompasses two “work packages” (cache line lengths), so the ap¬ 
plicable non-overlapping time is only TnOL = 8cy. The specific 
port numbers that are responsible for non-overlapping execution 
are defined in the hardware description file. 

The maximum cycles over all other ports make up the overlapping 
portion. It is 18cy with the code above on a Sandy Bridge microar¬ 
chitecture, thus Tol = 9cy. 














4.5 Data traffic analysis 

The data traffic analysis is the central part of the Kerncraft tool, 
enabling most of the insight when analyzing streaming and sten¬ 
cil loop kernels. Its goal is to come up with predictions for the 
data volume from and into all memory hierarchy levels, so that 
the cycle counts required for the Roofline and ECM models can 
be determined. To this end we have implemented a simple “cache 
simulator” (see Sect. |4.1 [ above for the inherent assumptions about 
the hardware). In the following we explain its functionality using 
the 2D-5pt Jacobi example, since it is a basic but nontrivial case 
where temporal reuse of data is crucial for understanding perfor¬ 
mance properties. 

Each cache level is inspected independently, although only misses 
in lower levels (closer to registers) are passed to the next higher 
level. We start off with all data requests from the loop itera¬ 
tion i, j in the code from Listing as shown in Tables and 
a[j-l] [i], a[j] [i-l],b[j] [i], a[j] [i+1], a[j+l] [i]. Un¬ 
fortunately, this notation does not offer any information about the 
actual location of the values in memory, so we change to a ID- 
notation: a[(j-1)*N+i], a[j*N+i-l], b[j*N+i], a[j*N+i+l], 
a[(j + l)*N+i]. The mapping from 2D to ID emphasizes the 
streaming nature of the data accesses, but without N, i, and j, it 
is still not sufficient. The inner problem size N is specified by the 
user; we choose = 40 here. The indices i and j are kept abstract, 
but we assume a “loop center” at a relative offset of 0 in either 
direction. If we plug this back into the ID offsets, we are left with 
the following relative offsets: —40, —1, 0, +1, -f40. 

We now start adding single iterations backwards in the indices i 
and / until the cache size is exceeded. After each addition of new 
offsets, they are checked for overlaps with the original set of ac¬ 
cesses. If there is an overlap this is counted as a cache hit. Once 
the cache is full, it is easy to see that all offsets from the original set 
which were not turned into hits must be misses and thus contribute 
to cache or memory traffic. See Eigure|^for an illustration of all 
three cache levels. Note that, if there are many hits in the first level 
cache (LI), they are also hits in the second and last level, but since 
only misses lead to utilization of bandwidth, hits can be ignored. 
In our example we find one miss going all the way to memory: It 
is the first access to a new cache line in the outer loop direction 
(index +40, depicted in Fig.|^as a black filled square). Also, there 
are two misses on LI to the right of the loop center and above it, at 
offsets — 1 and +40, but they are hits on L2. This situation can be 
expressed in terms of the “layer condition” |18| : Three consecutive 
layers of the grid fit into the L2 and L3 caches, but not into the LI 
cache. However, the LI cache is certainly large enough to ensure 
that all accesses to the left (index — 1) will be LI hits. 

The actual communication between caches and memory is done in 
units of cache lines. Therefore the number of cache lines needs to 
be calculated from the offsets. We do not know any real memory 
address, so we arbitrarily decide that the first cache-line starts at 
offset 0. Fig. 1^ contains all information to predict data amounts 
originating from accesses to a[] between memory hierarchy lev¬ 
els: the all-black cell is a miss throughout all cache levels. The 
yellow cells are cached by LI and therefore constitute traffic be¬ 
tween registers and LI, which is encapsulated in TnOL given by 
lACA. The half-purple-half-black cells are misses in LI, but hits 
in L2 and therefore generate traffic between LI and L2 and add to 
7 Li-l 2 with two cache lines. Since everything was already caught 
by higher cache levels, L3 has only the one original miss to handle. 


In order to take care of write-allocates, all writes offsets are also 
treated as reads. All write offsets are added to an evict list and no 
caching is tracked on this, meaning that all writes are immediately 
evicted as long as they go to offsets in arrays. In our example, this 
leads to an additional cache line transfer for each cache level. 

4.6 Model construction 

Kerncraft currently supports six analysis modes: 

Roofline 

Roofline with a simple in-core model based on arithmetic 
peak performance and LI cache as an additional bandwidth 
bottleneck; does not require compiler or lACA 

RooflinelACA 

Roofline with lACA for in-core modeling 

ECM 

Full ECM model with lACA for in-core modeling 
ECMDat a 

Data transfer portion of ECM only; does not require compiler 
or lACA 

ECMCPU 

In-core (lACA-based) model only 
Benchmark 

Benchmark run for model validation 

In the following we describe how the models in the first five modes 
are put together. 

4.6.1 Roofline 

The Roofline model can be conshucted in two different ways, 
which are distinguished by the in-core modeling: either by us¬ 
ing lACA (Roof linelACA mode) or by calculating the theoretical 
(MULT-l-ADD) arithmetic peak performance of the CPU (Roof line 
mode). If the theoretical peak is used, the LI cache needs to be 
considered as another cache level and potential bandwidth bot¬ 
tleneck as described below. The peak performance is usually an 
over-optimistic estimate, because it assumes the perfect mix of op¬ 
erations, full SIMD vectorization, etc., for the given CPU. A more 
realistic model will replace this in future work. 

Each memory level (except the LI cache in Roof linelACA mode) 
is considered a potential data transfer bottleneck. As described in 
Sect. |2.2[ cycle counts for the calculated data volume are deter¬ 
mined based on microbenchmark results in the hardware descrip¬ 
tion, giving a minimum execution time for every memory hierarchy 
level. Several microbenchmarks are available to provide a “closest 
match” to the actual loop code: e.g., if one read stream, one write 
stream, and one write-allocate stream hit a certain memory level, 
the measured bandwidth of an array copy benchmark in that level 
is used as the bandwidth baseline. 

Among all lower execution time bounds for all memory levels and 
for the in-core execution, the largest is used as the Roofline model 
prediction (single-bottleneck view). Apart from the prediction in 
cycles per cache line, the tool also calculates the arithmetic inten¬ 
sity for the memory level that was found to be the bottleneck. Eor 
producing the output the cycle counts are by default converted to 
performance numbers in fiop/s. This can be changed via a com¬ 
mand line option (—unit cy/CL). In general, the units cy/CL, 
It/s, and FLOP/s can be selected. 


^ inner loop direction 




size of inner dimension (N) 


□ cached in second {L2) and last level cache miss in LI and hit in L2 miss on all cache levels 

I I cached in first (LI), second and last level cache cached only in last level cache loop center 


Figure 2: Cache usage prediction for the right-hand side array a [] on the 2D-5pt Jacobi kernel, with Af = 40 on a hypothetical machine with 
cache sizes that ensure the fulfillment of the layer condition in L3 and in L2, but not in the LI cache. 


4.6.2 ECM 

The ECM analysis mode is split into two sub-modes: ECMData 
for the data access and cache analysis, and ECMCPU for the in-core 
analysis performed by lACA. The model ECM combines both sub¬ 
models and gives an overall prediction. The reason for the split is 
that lACA may not be available on all systems, and it does not sup¬ 
port microarchitectures other than Intel’s. In such cases one may 
still use the ECMData model to get the data traffic analysis. 

Using the in-core prediction and the transfer times between adja¬ 
cent cache levels, the ECM model is put together and printed using 
the syntax as described in Sect. |2.3| By default the report is in 
cy/CL, but the standard options (see the previous section) are avail¬ 
able to change this. The tool also reports the number of cores at 
which the performance is expected to saturate. 

Examples for the application and the output of the tool can be found 
in Sect.|5]below. 

4.7 Model validation (Benchmark mode) 

The Benchmark mode of Kerncraft does not predict performance 
but measures it. Therefore this mode must be executed on the same 
machine as described in the hardware description file. A match¬ 
ing hardware description is necessary to determine the appropriate 
compiler flags, cache line sizes, and clock speed. For the measure¬ 
ment results to be generated correctly, the user must fix the CPU 
frequency to the clock speed defined in the machine file. 

In order to allow execution of the kernel code, it is transformed in 
a similar manner as for lACA, described in Section [UT] but with 
one extra for loop inserted that wraps all kernel loops. This is 
to prolong the total execution time for improved accuracy with 
small problem sizes. Also, calls to the LIKWID marker applica¬ 
tion programming interface are inserted before and after the outer 
f or loop for a precise measurement of the loop section only, with¬ 
out overhead from allocating and initializing memory. After suc¬ 
cessful transformation and compilation, the binary is executed with 
likwid-perf ctr, LIKWID’s performance counter and pinning 
command. Proper thread-core affinity is therefore ensured. The 


marker calls can also be used for advanced validation using data 
volume, cycles per instruction, and work-related metrics, but this 
feature is not leveraged in the current version of Kerncraft. The 
runtime is used to calculate the number of cycles per work unit. 

5. EVALUATION 

Here we evaluate the correctness of predictions derived by Kern¬ 
craft from kernel codes and hardware descriptions. It is out of the 
scope of this work to evaluate the underlying performance models; 
this has been discussed elsewhere (4l|l][8l[T8l|TTl|T0). We will, 
however, compare predictions by Kerncraft to predictions derived 
by manual analysis in previously published papers (see Table|^ and 
point out relevant differences and peculiarities. We only consider 
predictions for serial code on in-memory data sets here, although 
both models are certainly capable of providing in-cache and multi¬ 
threaded results as well (supported by the —cores option). The 
same binary code has been produced for the Sandy Bridge and the 
Haswell CPU, but lACA was run with the appropriate architectural 
setting. 

5.1 Stencil Codes 

We choose three interesting “comer case” stencils that have been 
comprehensively analyzed in |18| using the ECM model on a 
Sandy Bridge base microarchitecture: 2D-5pt Jacobi, UXX, and 
a fourth-order long-range stencil. 

5.1.1 2D-5pt Jacobi 

In Listing 1^ we show the Kerncraft command line arguments and 
generated output for the 2D 5-point Jacobi stencil (Listing for 
Sandy Bridge. In line number ?? we see the ECM model and line 
number ?? shows the ECM prediction notation. The results in Ta¬ 
ble]^ have a difference of 3cy in Tol on Sandy Bridge compared 
to the analysis in (TS) . This is due to the compiler using half-wide 
(16-byte) load instructions in our case. The half-wide loads take the 
same number of cycles as full-wide loads, but require two addresses 
to be generated. In combination with an additional address required 
for full-width store operations, this leads to an average of 9cy of 

































































Kernel Code 

Arch. 

Kerncraft 

Const. ECM Model EClvl^ RooflintP Bench. 

N (cy/CL) (cy/CL) (cy/CLF (cy/CL) 

Reference 

ECM Model RooflintP 

(cy/CL) (cy/CLF 

2D-5pt List. 

0 

SNB 

HSW 

{9.5||8|10|6|12.7} 36.7 29.8 36.4 

^ {9.4||8|5|6|16.7} 35.7 26.6 30.0 

{6||8|10|10|13} |[^ n/a 

n/a n/a 

UXX List. 

§ 

SNB 

HSW 

{84II 32.5 20120126.3} 98.8 84.0 112.5 

{56 27.5 10 20 31.6} 89.1 61.7 86.9 

{84II 38120120126} ii/a 

n/a n/a 

long-range List. 

0 

SNB 

HSW 

{57 II 53124124117.0} 118.0 65.9 134.2 

{57II 47.5112124122.3} 105.8 63.6 104.5 

{68 II641241241 17} n/a 

n/a n/a 

Kahan-dot List. 

0 

SNB 

HSW 

{96||8|4|4|7.8} 96.0 96.0 101.1 

{96 8 2 4 9.1} 96.0 96.0 98.0 

{32||8|4|4|7.9}|^(^ n/a 

n/a n/a 

Schonauer 

Triad List. 

0 

SNB 

HSW 

{4||6|10|10 21.9} 47.9 54.3 58.8 

{4||3|5| 10|26.3} 44.3 46.4 48.3 

{4||6| 10| 10|24} ii/a 

n/a n/a 


Table 5: Single-thread Kerncraft predictions for five benchmark kernels in comparison to reference predictions from previous publications. 

°in-memory prediction 
^without empirical penalties 


Listing 5: Kerncraft output for the analysis of the 2D-5pt Jacohi 
kernel on the Sandy Bridge microarchitecture (node name “phi- 
nally”) with N = 6000 and M = 6000 (output adapted for brevity) 

$ kerncraft -p ECM --cores 1 \ 

2 -m machine-flies/phinally.yaml \ 

kernels/2d-5pt.c \ 

4 -D N 6000 -D M 6000 

6 saturating at 3 cores 

8 $ kerncraft -p RooflinelACA \ 

—unit cy/CL --cores 1 \ 

10 -m machine - f iles/phinally . yaml \ 

kernels/2d-5pt.c \ 

12 -D N 6000 -D M 6000 -v 


14 Bottlnecks : 


level 1 

16 1 

ar.int . I perf or . | 

FLOP/b 1 cy/CL | 

bandw. I 
GB/s 1 

bw ke: 






18 CPU 1 

1 

9.5 1 

1 


L1-L2 1 

0.1 1 

16.9 1 

51.2 1 

triad 

20 L2-L3 1 

0.1 1 

27.5 1 

31.5 1 

triad 

L3-MEM 1 

0.17 1 

29.8 1 

17.4 1 

copy 

22 





Cache or 

mem bound 




24 

(bw with 

from copy 

benchmark) 

Arithmetic Intensity 

: 0.17 FLOP/b 



address generation per cache-line. The same happens on Haswell, 
although it has one additional address generation unit. However, 
this unit can only be utilized with simple addressing modes, and 
the compiler does not leverage this feature. On Haswell the ECM 
model is outperformed by the measurement by almost 20%. This 
is very unusual, although the model is not strictly optimistic like 
Roofline, and will he investigated. 

The Roofline model is much more optimistic than the ECM model 
for this code, since the memory bottleneck does not dominate the 
single-core execution time: The layer condition can only be satis¬ 
fied in the L2 cache for the chosen inner problem size {N = 6000). 
If spatial blocking for the LI cache is performed (or if the inner 
loop size is short enough). Roofline becomes more accurate as 
shown in |18|. 


Listing 6: Kernel code for the UXX stencil in double precision 

double ul[M][N][N], dl [M] [N] [N] , xx [M] [N] [N] ; 

double xy[M][N][N], xz[M][N][N], cl,c2,d,dth; 

forCint k=2; k<M-2; k++) { 

forCint j=2; j<N-2; j++) { 

forClnt 1=2; i<N-2; i++) { 

d = (dl [k-1] [j ] [i] + dl [k-1] [j-1] [i] 

+ dl[k][j][i] + dl [k] [ j-1] [i] ) *0.25 ; 

ul[k][j][i] = ul[k][j][i] + (dth/d) 

* ( cl*(xx[k] [j] [i] - xx[k] [j] [i-1]) 

+ c2* (xx [k] [ j ] [i + 1] - XX [k] [j ] [i-2] ) 

+ cl*(xy [k] [j+1] [i] - xy [k] [j-1] [i] ) 

+ c2* (xy [k] [j+1] [i] - xy [k] [ j-2] [ i] ) 

+ cl*(xz [k] [j] [i] - xz [k-1] [ j ] [i] ) 

+ c2*(xz [k + 1] [j] [i] - xz [k-2] [ j ] [i] ) ) ; 

}}} 


5.1.2 UXX (double precision) 

The UXX stencil (Listing has a long-latency divide operation 
in the inner loop that cannot be easily avoided and that dominates 
the overlapping time. This explains the large difference between 
Sandy Bridge and Haswell in terms of Tql- The discrepancy in 
TnOL compared to the reference result is due to the more current 
compiler, but there is also a deviation between Haswell and Sandy 
Bridge: The compiler generates a mixture of full-wide (32-byte) 
and half-wide (16-byte) load instructions. Since both architectures 
can execute two half-wide loads per cycle, hut only Haswell can 
do two full-wide loads per cycle, Haswell has a slight advantage in 
load throughput and, consequently, in TnOL- 

5.1.3 Long-range 

The fourth-order long-range stencil (Listing shows a very low 
fraction of execution time in TLSMem due to a massive amount 
of data (layers) that have to be supplied by the caches. The re¬ 
sults in Table 1^ exhibit a slight deviation in Tql of 16cy and in 
TnOL of 9cy on Sandy Bridge compared to published results. The 
lower cycle predictions in our analysis can be attributed to better 
compiler-generated code (fewer register spills). Again, Haswell 
has a slight advantage over Sandy Bridge in terms of InOL cycles 



















Listing 7: Kernel code for the long-range stencil 

double U[M][N][N], V[M][N][N], ROC [M] [N] [N] ; 

double cO, cl, c2, c3, c4, lap; 

fordnt k = 4; k < M-4; k + + ) { 
for(int j=4; j < N-4; j++) { 
fordnt i=4; i < N-4; i + + ) { 

lap = cO*V[k] [j] [i] 

+ cl * (V [k] [ j ] [i + 1] + V [k] [j ] [i-1] ) 

+ cl * (V [k] [ j+1] [i] + V [k] [j-1] [i]) 

+ cl * (V [k + 1] [ j ] [i] + V [k-1] [ j ] [i]) 

+ c2*(V [k] [j] [i + 2] + V[k] [j] [i-2] ) 

+ c2*(V[k] [j+2] [i] + V[k] [j-2] [i]) 

+ c2*(V[k + 2] [j] [i] + V[k-2] [j] [i]) 

+ c3 * (V [k] [ j ] [i+3] + V [k] [j ] [i-3] ) 

+ c3 * ( V [k] [ j+3] [ i] + V [k] [j-3] [i] ) 

+ c3 * ( V [k + 3] [ j ] [ i] + V [k-3] [ j ] [i] ) 

+ c4* (V [k] [ j ] [i+4] + V [k] [j ] [i-4] ) 

+ c4* ( V [k] [ j+4] [i] + V [k] [j-4] [i] ) 

+ c4* ( V [k + 4] [ j ] [i] + V [k-4] [ j ] [i] ) ; 

U[k][j][i] = 2.f*V[k] [j] [i] - U[k][j][i] 
+ R0C[k][j][i] * lap; }}} 


due to the compiler partially producing full-wide load instructions. 
The Roofline prediction is massively over-optimistic in all cases 
since it assumes a core-bound situation, whereas in fact the execu¬ 
tion time is spread almost evenly between in-core and data transfer 
contributions. 

The long-range stencil constitutes an instructive example for demon¬ 
strating the complicated situations that can arise with stencil codes 
on architectures with multiple caches. In Fig. we show graphi¬ 
cally the contributions to the ECM model in cycles per cache line 
versus the inner and middle loop size (both set to N). The entry 
in Table louses = 100 (marked in the figure). Varying N leads 
to layer conditions being fulfilled and violated in different cache 
levels; overall, six situations can be distinguished over the N range 
considered, shown at the bottom of the figure: for instance, around 
N ~ 1000 the layer condition in the outer (k) dimension is met in 
the L3 cache alone, whereas the L2 and LI caches are too small 
to even hold a sufficient number of rows to meet the layer condi¬ 
tion in the middle (j) dimension. Figure shows the predictions 
together with measured data (crosses). The general behavior with 
increasing N is well tracked by the model starting atN> 200, but 
there are considerable deviations for smaller N. This is expected 
since the steady-state assumption cannot be upheld with short inner 
loops, and boundary effects become non-negligible. 

5.2 Streaming Kernels 

Kerncraft does not only support the analysis of stencil codes, but 
also accepts typical streaming kernels found in numerical vector 
arithmetic and memory benchmarks. Here we will have a look 
at the Kahan-compensated double-precision dot product and the 
Schdnauer Triad. These have been analyzed thoroughly in GD and 
respectively. 
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Figure 3: Top: Single-core ECM model predictions for the 3D 
long-range stencil (Listing]^ with increasing inner (and middle) 
dimension N and a memory-bound data set. Bottom: Visualization 
of fulfilled layer conditions in the three cache levels. A 3D/2D/1D 
layer condition pertains to loop index k, j, and i, respectively. 
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Figure 4: Validation of the single-core ECM prediction for the 
3D long-range stencil (as presented in Fig.[^, with measurements 
(crosses) on SNB. 


Listing 8: Kernel code for Kahan-ddot 

double a[N] , b[N] , c; 
double sum, prod, t, y; 

fordnt i = 0; i<N; ++i) { 

prod = a[i] * b[i]; y = prod - c; 
t = sum + y; c = (t - sum) - y; sum = t; 

} 


5.2.1 Kahan-ddot 

The code is shown in Listing It provides a scalar product of 
two arrays a [] and b [], correcting for round-off errors due to 
the finite-precision floating-point number representation |14| . As 
was shown in GD- current compilers fail in generating efficient (or 
correct) machine code from the C source. In our case the com¬ 
piler could not use SIMD vectorization due to the presence of a 
loop-carried dependency, but it produced correct scalar code with¬ 
out further unrolling. As a consequence the runtime is dominated 
by the loop-carried dependency and I AC A reports Tql = 96cy/CL 
even in throughput mode. Here we find a case where the ECM and 
Roofline predictions coincide due to the strong dominance of Tol- 





























































The result for TnOL on Sandy Bridge in Table|^differs from the ref¬ 
erence result, since the latter was produced with a scalar but other¬ 
wise optimal version of the code that used modulo unrolling to hide 
the inter-iteration stalls. Note also that GD uses latency penalties 
to make the ECM model work better in memory. The Kerncraft 
tool has this capability (in fact, the penalty cycles are part of the 
machine files), but it is deactivated by default. 

5 . 2.2 Schonauer Triad 

The Schonauer Triad (Listing]^ is a simple streaming kernel which 
is designed to be limited by data transfers on all processor architec¬ 
tures even if the data resides in the innermost cache. 

The deviation of 7l3-mem compared to the reference results in Ta¬ 
ble]^ on Sandy Bridge is due to the application of a slightly dif¬ 
ferent peak memory bandwidth. It is striking that the ECM model 
is more optimistic than Roofline for this benchmark, but this only 
shows that the latter is based on a measured bandwidth value (with 
the very same microbenchmark kernel); on the other hand it is 
known that the ECM model is slightly optimistic with very tight, 
data-bound kernels, presumably because of hardware prefetchers 
not keeping up with the in-core execution in this case. Since the 
model assumes perfect prefetching, this deviation could be fixed 
by adding latency penalty cycles to the single-core execution. 

Listing 9: Kernel code for the Schonauer Triad 
double a[N] , b [N] , c [N] , d[N], s; 

forCint i=0; i<N; ++i) 

a[i] = b[i] + c[i] * d[i] ; 


6. RELATED WORK 

There are many performance modeling tools that rely on hardware 
metrics, statistical methods, curve fitting, and machine learning, 
but there are currently only four projects in the area of automatic 
analytic modeling that we know of: PBound, ExaSAT, Roofline 
Model Toolkit and MAQAO. 

Narayanan et al. GD describe a tool (PBound) for automatically 
extracting relevant information about execution resources (arith¬ 
metic operations, loads and stores) from source code. They do 
not, however, consider cache effects and parallel execution, and 
their machine model is rather idealized. Unat et al. (22| introduce 
the ExaSAT tool, which uses compiler technology for source code 
analysis. They also employ “layer conditions” jl8) to assess the 
real data traffic for every memory hierarchy level based on cache 
and problem sizes. They use an abstract simplified machine model, 
whereas our Kerncraft tool employs Intel lACA (see Sect. [13 to 
generate more accurate in-core predictions. On the one hand this 
restricts Kerncraft in-core predictions to Intel CPUs, but on the 
other hand provides predictions from the actual machine code con¬ 
taining all the optimizations provided by the compiler. Eurther- 
more, ExaSAT is restricted to the Roofline model for performance 
prediction. On the other hand, being compiler-based, ExaSAT sup¬ 
ports full-application modeling and code optimizations, which is 
out of scope in the current version of Kerncraft. It can also in¬ 
corporate communication (i.e., message passing) overhead. Lo et 
al. have recently introduced the “Roofline Model Toolkit,” 
which aims at automatically generating hardware descriptions for 
Roofline analysis. They do not support automatic topology detec¬ 
tion, however, and their use of compiler-generated loops introduces 
an element of uncertainty. Djoudi et al. Q started the MAQAO 


Project in 2005, which uses static analysis to predict in-core ex¬ 
ecution time and combines it with dynamic analysis to assess the 
overall code quality. It was originally developed for the Itanium 2 
but has since been adapted for recent Intel64 architectures and the 
Xeon Phi. As with Kerncraft, MAQAO currently supports only In¬ 
tel architectures. The memory access analysis is based on dynamic 
run-time data, i.e., it requires the code to be run on the target archi¬ 
tecture. 

There is to our knowledge no tool comparable to lACA for non- 
Intel CPUs. Tailback support for in-core predictions based on au¬ 
tomatic code analysis and architectural information (as done, e.g., 
in (22|) is work in progress. 


7. SUMMARY AND OUTLOOK 

The power and utility of analytic performance modeling is undis¬ 
puted, but the construction of accurate models is a tedious task 
that requires considerable time and experience. The Kerncraft tool 
enables analytic (Roofline or ECM) modeling of loop kernels on 
modem CPU architectures under well-defined conditions. We have 
shown that predictions generated by Kerncraft concur with pub¬ 
lished numbers on a variety of codes and two modern Intel archi¬ 
tectures. However, it must be emphasized that Kerncraft is not a 
“black-box” tool that can be applied blindly without background 
knowledge. In contrast, it is the deviation of generated models from 
experience or from measured results that sparks new insight. This 
is impossible without a basic understanding about the software and 
the capabilities of the hardware it is mnning on. 

Currently the applicability of the tool is restricted by dependencies 
on the Intel-specific lACA tool and compilers. But even without 
those Kerncraft can still predict data locality and cache behavior. 
Since the basic principles of the modeling process implemented in 
Kerncraft are widely applicable, support for other architectures is 
a matter of extending the ECM model appropriately. Currently we 
are working on adapting the model to ARM-based chips. To further 
enhance the tool in the short term we will support single precision 
floating-point numbers, passing pragma directives to the compiler, 
and allowing for other compilers than ICC. On the ECM model 
side, non-inclusive cache hierarchies and non-temporal stores need 
further evaluation but will be implemented as well. It will be a 
more difficult task to find or build a suitable replacement for lACA 
and to predict the data transfer properties of non-contiguous access 
patterns. The simplistic cache model in Kerncraft may be extended 
to non-LRU cache replacement policies and non-fully associative 
caches, but we consider such additions less crucial. 

To combine dynamic analysis with performance modeling, we plan 
to introduce phenomenological models where data traffic and band¬ 
width measurements are used to improve the validation procedure 
and to capture effects that can not be easily modeled, such as irreg¬ 
ular memory accesses as found, e.g., in sparse-matrix algorithms. 
The required infrastructure to take the necessary data is already 
functional in the current version of Kerncraft. 
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